Exploring Red Wine Quality

Introduction

Red wine quality is explored, observed and analyzed in this project. The underlying objective is to understand the chemical properties that influence the quality of red wines. The statistical program, R, is used for this exploratory data analysis where the dataset can be found here and additional literature on the variables can be found here.

Univariate Plots Section

The following are some basic statistics on the dataset and the quality variable.

# Summary Statistics
str(wq)
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
summary(wq)
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000
summary(wq$quality)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

From the 1,599 wine observations across 13 numeric variables, it should be noted that X appears to be the unique identifier with quality being the primary output. It is based on a 10-point scale and was rated by at least three wine experts. Interestingly, the wine quality ranged from 3 to 8 with an average of 5.6 and a median of 6. This indicates that the quality variable is ordinal and discrete.

table(wq$quality)
## 
##   3   4   5   6   7   8 
##  10  53 681 638 199  18

The following are histogram plots for the 12 variables to kick off the data visualizations.

Univariate Analysis

What is the structure of your dataset?

There are 1,599 wine observations across 13 numeric variables where X is the unique identifier and fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol, quality are the 12 features.

The first 11 variables are physicochemical data points on wine samples and the quality is an 10-point scale output based on sensory data from at least three wine experts.

What is/are the main feature(s) of interest in your dataset?

The main feature of interest is quality. From the Univariate Plots Section, it can be observed that quality follows a near normal distribution where the bulk of the observations are in the 5-6 range with some outliers on either end. This can further outlined by using a more pronounced variable rating, such that a quality score of 0-4 denotes a Poor wine, a score of 5-6 denotes an Average wine, and a score of 7+ denotes a Good wine.

##    Poor Average    Good 
##      63    1319     217

Throughout this exploratory data analysis, the drivers of quality will be unearthed and examined.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Similar to quality, density and pH seem to be normally distributed. Fixed and volatile acidity, free and total sulphur dioxide, sulphates, and alcohol seem to be skewed and long-tailed. It is ambiguous as to what features directly affect quality, but from some high-level research, it appears that alcohol content, acidity and pH might be contributors to quality.

Further researched failed to highlight the difference in benefit of the different types of acidity in wine. Thus, for the purpose of this project, fixed acid (tartaric acid), volatile acid (acetic acid) and citric acid were combined into a variable named, acidity. It should be also noted that the presence of sulphur dioxide and sulphates indicate the presence of sulphuric acid - this is ignored as being beyond the scope of this project.

Did you create any new variables from existing variables in the dataset?

A new variable, rating, was defined that categorized the wine quality ratings into Poor, Average, and Good buckets to illustrate its normal distribution. Lastly, a key variable, acidity was declared as a sum of fixed acidity, volatile acidity and citric acid. It is hypothesized that acidity is a driver of wine quality.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

The distribution of citric acid is fairly unusual given that the distribution of fixed acidity and volatile acidity on a logarithmic scale conforms to the normal distribution of pH. It appears that citric acid has a large number of null values, which could be incomplete or unavailable data.

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

The dataset in general was fairly tidy such that additional wrangling was not needed.

Bivariate Plots Section

The bivariate plots begin with a scatterplot matrix. Unfortunately, due to the large file size, generating such a plot took much too long. Instead, a sample of the dataset was used to begin the exploration.

The boxplots on rating and some of the correlations seem noteworthy. They were subsequently explored.

These boxplots provided some very interesting insights. It appears that fixed acidity, citric acid, sulphates and alcohol are directly correlated with better wine quality, and volatile acidity and pH are indirectly correlated. The difference in behavior of the acids does bring into question the decision of having a combined acidity variable, but a better assessment will be made in subsequent section.

##                    X        fixed.acidity     volatile.acidity 
##           0.06645261           0.12405165          -0.39055778 
##          citric.acid       residual.sugar            chlorides 
##           0.22637251           0.01373164          -0.12890656 
##  free.sulfur.dioxide total.sulfur.dioxide              density 
##          -0.05065606          -0.18510029          -0.17491923 
##                   pH            sulphates              alcohol 
##          -0.05773139           0.25139708           0.47616632 
##              quality               rating              acidity 
##           1.00000000           0.81236704           0.10375373
##                    X        fixed.acidity     volatile.acidity 
##           0.11527163           0.11423756          -0.39124918 
##          citric.acid       residual.sugar            chlorides 
##                  NaN           0.02353331          -0.17613996 
##  free.sulfur.dioxide total.sulfur.dioxide              density 
##          -0.05008749          -0.17014272          -0.17517368 
##                   pH            sulphates              alcohol 
##          -0.05757386           0.30864193           0.47698109 
##              quality               rating              acidity 
##           0.97556915           0.79200148           0.09282597

Correlation tests were performed on a plain and logarithmic scale. As expected, citric acid, alcohol and, to a lesser extent, fixed acidity had a positive correlation while volatile acidity had a negative correlation to quality. Interestingly, sulphates appeared to have a stronger correlation on a logarithmic scale, and pH seemed to be hardly correlated.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

From the boxplots, it appears that fixed acidity, citric acid, sulphates and alcohol are directly correlated with better wine quality, and volatile acidity and pH are indirectly correlated. From the correlation tests, similar trends were observed with the exception of the pH showing only about 5.7% correlation and suphates having a better correlation of 30.8%.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

The acidity and sulphur dioxide relationships were examined.

There seems to be a trend between fixed acidity and citric acid, and volatile acidity and citric acid, but oddly there seems to be no relationship between fixed acidity and volatile acidity. This could be that the underlining chemistry are not dependent upon each other.

As a purely positive control test, the logarithmic relationship of acidity and pH were observed.

##        cor 
## -0.7044435

As expected, the higher the acidity, the lower the pH value with a correlation coefficient of 70.4%.

The relationship of free and total sulphur dioxide were investigated.

##       cor 
## 0.6676665

A correlation coefficient of 66.7% indicates that there is a fairly strong relationship between the two sulphur dioxide states. Some research, indicates that sulphur dioxide is an antimicrobial in wine making and that free sulphur dioxide originates from the total.

What was the strongest relationship you found?

The strongest relationship to quality were as follows: - alcohol: 47.6% - sulphates (log10): 30.9% - citric acid: 22.6% - fixed acidity: 12.4% - volatile acidity: -39.1%

Multivariate Plots Section

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

For the multivariate plots, the features that bore the strongest relationship to quality were observed by splitting the plots by quality score and faceting them by the three rating categories. It can be noted that higher alcohol, sulphates, citric acid, and fixed acidity, and lower volatile acidity leads to better wine quality. This is inline with the insights uncovered thus far.

Were there any interesting or surprising interactions between features?

Since alcohol, specifically ethanol, is a weak acid, it was thought to be somewhat correlated with the presence of other acids, such as citric acid. The plot of alcohol against citric acid above clearly show their lack of correlation to each other.

To close off the discussion around pH, it can be visually observed to not be driver of wine quality when compared with the very obvious alcohol variable. Though, it should be noted that pH is dependent on the concentration of acids in wine, and as such doesn’t seem to vary far from the 3-4 range.


Final Plots and Summary

From the numerous plots above, it can be found that acidity, alcohol content and sulphates contribute to good wines. The final plots will illustrate these findings.

Plot One: Acidity on wine quality

It can be noted that not all acids are created equal. These boxplots illustrates that higher fixed acidity (tartaric acid) and citric acid are found in better quality wines. Furthermore, the absence of volatile acidity (acetic acid) also contributed to a higher wine quality. Therefore, a lower pH alone would be a red herring for wine quality. Afterall, higher acid concentration will lead to a lower pH value, but only tartaric and citric acid seem to benefit wine quality.

Plot Two: What is wine if it can’t get you drunk?

These boxplots shows a trend of higher wine quality ratings with higher alcohol content. While it is likely that a higher caliber wine would have a higher percentage of alcohol, additional experimentation is needed to support causation due to the presence of outliers in the Average category.

Plot Three: Putting sulphates into perspective with alcohol content

This final plot illustrate that good wines have an abundance of sulphates and alcohol at the same time. The dotted lines represent the mean for each respective axes, whereby the top right quadrant has a high density of Good wine ratings.


Reflection

Exploratory data analysis proved to be very effective in understanding relationships within the red wine quality dataset. There were no notable struggles encountered throughout this analysis. It was found that fixed acidity, citric acid, alcohol content and sulphates positively drive wine quality, and volatile acidity negatively drive wine quality. Boxplots seemed to be the most telling visualization for this dataset.

Though it should be noted that wine quality is highly subjective on a individual’s taste; a better study would be the inclusion of wine quantities sold in the market. Further analysis using inferential statistics and similar methodologies should be used to verify the findings in this exploration. Nevertheless, the plots here did uncover an interesting and telling story of wine quality in the available observations.


References